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Abstract — In psychology facial expressions are used to 
analyze the behavioral aspects of a subject to know the 
suppressed feelings such as anger, sadness, happiness and other 
more. These expressions are proven to be more successful in 
identifying the mood and the real intentions of the subject but 
the main problem with facial expressions is that they can be 
faked which Leeds to misjudging the subject. Recent studies 
show that there are some leaked micro-expressions which occur 
for very small duration i.e. 1/3 to 1/25 second and can’t be 
controlled thus can’t be faked. These micro expressions are 
nearly impossible to detect by naked eyes and without special 
training. The system will detect these facial micro-expressions 
which help to reveal the true feelings of the subject. The method 
is capable of spotting both macro and micro expressions which 
are typically associated with emotions such as happiness, 
sadness, anger, disgust, and surprise, and rapid micro¬ 
expressions which are typically, but not always, associated with 
semi-suppressed macro-expressions. 

Index terms - facial micro-expression, deception detection, 
suppressed emotion detection. 

I. INTRODUCTION 

Human brain is capable of recognizing facial expressions 
which provides a vast source of important and affective 
information. After thirty years of research in 

micro-expressions by Ekman, Frank and O'Sulliva [1] and a 
depended group of Portet [2] these micro-expressions were 
found an important behavioral source for detecting deception 
and danger demeanor detection as well [1]. These facial 
micro-expressions are brief, involuntary expression shown by 
the human face when they are trying to hide or fake an 
emotion. Micro-expressions usually occur at high-stakes 
moments, where people have something to gain or lose [3]. 
Theses micro-expressions are fast involuntary facial 
expressions which gives a brief reaction to feelings that 
people undergo but try to hide the feelings. 

From the technical point, the detection of facial 
micro-expressions is a hard task using the traditional 
approaches. Duration of micro-expression is l/25 th to l/3 rd of 
a second and with appearance of low muscle intensity. For 
detecting these micro-expressions requires only a use of a 
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high-speed camera. Only highly trained people are able to 
detect micro-expressions with naked eyes. 

There are number of potential applications of 
micro-expressions such as police can use them to detect 
abnormal behavior, and in medical field doctors can detect 
micro-expressions showing suppressed emotions to know 
when additional reassurance is needed, in education field 
teachers can recognize student’s unease and make the lecture 
more affective, business negotiators can use the 
micro-expressions to know when they have to propose a 
suitable price. 

The main objective in detecting facial micro-expression 
involve the short duration and there occurrence as they are 
involuntarily. The occurrence is limited to very short duration 
and low number of frames with a 25fps camera. To obtain a 
high detection rate the best practice is to use a high frame rate 
camera such as lOOfps or 200fps camera. 

In this paper proposed is a system for detecting facial 
micro-expression that achieves very good results. This system 
detects facial micro expression in three steps: 1) video 
acquisition, 2) feature extraction, 3) analysis of extracted 
features. 

II. RELATED WORK 

In facial data extraction and representation for expression 
analysis, two main approaches exist: geometric feature-based 
methods and appearance-based methods. A review can be 
found on bhall et. al.[4] 

The geometric facial features are presented by the shape and 
location of facial components (such as mouth, eyes, 
eyebrows, and nose). The facial components and facial 
feature points are extracted by some computer vision 
techniques that form a feature vector that represents the face 
geometry. 

Superior research results were reported on Active 
Appearance Model (AAM) by Kanade [5] group. However, 
there are two disadvantages of AAM. First, this approach 
requires extensive dataset with large amount of manually 
tagged points of the face. Second, the accuracy of facial 
feature tracking significantly decreases in the faces that were 
not included in the training set. 

Another approach is based on direct tracking of 20 facial 
feature points (e.g. eye and mouth corner, eyebrow edges) by 
particle filter [6]. This approach delivers good results for 
some facial motions, but fails in detecting subtle motions, that 
can be detected only by observing skin surface. The 
performance of this and similar approaches strongly rely on 
the accuracy of the facial feature points tracking. In practice, 
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facial feature points tracking algorithm cannot deliver the 
necessary accuracy for micro-expression recognition task. 

In Appearance-Based Methods, image filter, such as Gabor 
Wavelets are applied to either the entire face or specific 
regions in the face, to extract a feature vector. This method 
was applied for spontaneous facial motion analysis and 
considered to be the most popular [7]. However, this method 
is based on analyzing the video frame by frame, without 
considering correlation between frames. In addition, applying 
this approach for facial surface analysis requires large 
datasets for training an enormous number of filters. 

Using Spatio-Temporal Strain: In this method [8] for the 
automatic spotting (temporal segmentation) of facial 
expressions in long videos comprising of macro- and 
micro-expressions. The method utilizes the strain impacted on 
the facial skin due to the non- rigid motion caused during 
expressions. The strain magnitude is calculated using the 
central difference method over the robust and dense optical 
flow field observed in several regions (chin, mouth, cheek, 
and forehead) on each subject’s face. This approach is able to 
successfully detect and distinguish between large expressions 
(macro) and rapid and localized expressions (micro). 


MICRO-DETECT (C) 

I. Detect face, if face not found exit. 

II. Initialise block size T= {8*8*1, 5*5*1, 8*8*2, 5*5*2} 

and T={ 10,15,20,30} 

III. Convert video into frames. 

IV. For all 1 to number of frames P i>s . P i s 

i. In first frame P i?1 Detect face F , 

ii. Locate and extract facial feature points using 

ASM 

'P= {(a h bj) . (a h b h ) 

iii. Compare and normalise face with model face 

by calculating LWM transformation 

£ = LWM(T,to, P h 1) where co is feature 
point matrix for model face & P h 1 is the 
first frame 


III. METHOD 


iv. Apply transformation £ to all frames P i>2 to Pi,s 


3.1 Flowchart for Micro-Expression detection and 
analysis 



3.2 Algorithm 


v. Find eyes in each frame and crop the face 

using ASM 

vi. For each 6 C T compute Temporal Image 

Sequence 

%,= UMf(t) + Z* 

vii. For all p £ T, 0 £ T extract set of SLTDs 

Pi,p, o(^i, o) ~ {Q i,p, Qyl . Q i,p, 6> M} 

with SLTD feature vector length M 

V. Evaluate kernels K = 

{ Vj,k,m,0,p.CjeFck^CFm=1 eT/\p 

er /\r=(m, 6 ^IHISINT^,,^,), 

POLY(< 7 J>r , q kj , 2), POLY(<? 7> q kj , 6 )} 

VI. Micro=MKL-PHASE(K) 

C is the input data i.e. video recording of the subject’s 

facial movements. T is SLTD parameter set where x*y*t are 
the row, column and temporal block respectively. In these sets 
features are divided. T is the frame count set in which C/ 
image sequence is temporally interpolated. LWM^,®, P ) 
evaluates the local weighted mean transformation for each 
frame P by usage of feature points 4* and the model face 
feature points (D as in feature point extraction. 
HlSINT {qj yn q kyr ) , POLY (qj >r , d ) evaluates the polynomial 
kernel of degree d and the histogram intersection kernel as in 
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eq. 2 and 3. MKL-PHASEl(K) is the output i.e. detected 
facial micro-expression. 

3.3 Facial Feature Marking 

To locate the high variations in the spatial appearances of 
facial micro-expressions, cropping and normalising the face 
geometry according to the positions of eye from a Haar eye 
detector and the feature points are located using an Active 
Shape Model (ASM) [9] deformation. ASMs are one of the 
statistical models for the shape of an object that are repeatedly 
deformed to fit on an example of the object. It initiates the 
search from a average shape aligned to the location and size of 
the face stated by a face detector and iterates until 
convergence. The tentative shape is matched by template 
identical of image texture around located points to change 
feature point locations, and Fitting tentative shapes to the 
global shape model. Using 68 Active Shape Model feature 
points shown in Figure 1 we evaluate a Focal Weighted Mean 
(FWM) [10] transformation of frame p i}1 for sequence i. 
FWM evaluates the weighted mean of every polynomials that 
passes over each point by setting the values of an arbitrary 
point (x, y) to 

^ _ ZjLil'CvCac - x^ 2 + (y - yj 2 /RnjSjljx.y) 

~ If =1 v (v'O* ~ x if + (y - Yi ) 2 / K 

EQ. 1 

where Si(x, y) is polynomial with n parameters which passes 
through a measurement for control point (x b y t )and n - 1 other 
measurements nearest to it, V is the weight and Rn is the 
distance of (x h y t ) from its (n - l) th nearest control point in the 
reference image. We then apply the transformation to p^.... 
Pi S for an expression with s frames. Figure 1 illustrates the 
FWM transformation of the facial feature points in an 
example face compared to a model face. Haar eye detection 
outcomes were checked against Active Shape Model (ASM) 
feature points and these points were used to crop the image. 
Then spatiotemporal local texture descriptors (SFTD) are 
applied to the video for feature extraction. 





(b) (c) 

Fig: 1 facial feature point detection using a model face a) is 
the model face or example face b) is the facial points detected 
in the frame c) is the face after feature point detection. 

All micro-expression to a given set of frames are further 
normalized temporally 0eT For every micro-expression 


image sequence i we evaluate a temporally interpolated image 
sequence 9 = UMF n (t) +■ ^ 9 for all 6 eT. where U is the 
decomposition matrix for singular value, M denotes square 
matrix, F 12 (£) is a curve and £- l is a mean vector. 

Then apply SFTD (Spatiotemporal Focal Texture 
Descriptors) to the video for the process of feature extraction. 
SFTD requires a input video of a short length. In this case 
FBP-TOP[l 1] is used with radius of R=3 and the block size is 
shown in the Algorithm. These parameters needs to remove 
first and last 3 frames because descriptor can’t be placed here. 
To enabling at least 1 frame extraction for a segment there 
come’s a need of at least 7 frames of data. With a camera 
having 25fps framerate will generate a 1/3 to 1/25 second 
video which will be having 1 to 8 frames. It is important to 
derive more frames therefore SFTD is used for longest micro 
expressions. However it is expected to achieve more 
statistically stabilized histogram with high number of frames. 
This is demonstrated in the next section. 

To improve the classification results Multiple Kernel 
Teaming (MKF) [12] given training set 

. and set of kernels .... K M 

where K k v - and K k is positive semi-definite, Multiple 
Kernel Teaming (MKF) learns linear/non-linear 
combinational weights of kernels over different domains by 
optimizing a cost function Z(K f H) where K is basic kernels 
combination. As shown in algorithm in another section, 
combine polynomial kernels POFY of degrees 2 and 6 
histogram-intersection kernel HISINT with different SFTD 

parameters PeT over different temporal interpolations 0eT 
where 

POL S? 9k. r .- G '} — (1 9j.r9k,i0 

Eq. 2 

HISINT( 9 J - r ,q ftir ) = £* =1 mInfr J Vrf i r} 

Eq. 3 

And r=(m,0,p) and b is the no. of bins in q. ; r ,q kvr Random 
Forest and SVM is used as alternative classifiers. Our 
classification system is single phased. MKF-PHASEl(K) 
detects the occurrence of a facial micro-expression. The II nd 
phase is to classify the micro expression which is a part of 
future scope for our project. 

If MKF-PHASEl(K) = micro, classifies the facial 
micro-expression into arbitrary set of classes L={l 1 ....l n }. The 
task is divided into two pipelined phases which enables us to 

1) Optimizing phase’s separately. 

2) Tailor F for phase II for a given application while 

retaining the original optimised phase I. 

Further for labelling data for phase II require a further deeper 
analysis, which is subject to many labelling error. By 
separating the two phases we avoided the subjective labelling 
of expressions (Phase 2) which affect the whole detection 
process. 


3.4 Analysis for Expression Detection 

In this step the features extracted in each frame is the 
compared using Temporal Interpolation Method. This 
method is used previously proposed by Zhou et al. [11] for 
synthesise tracking movements of talking mouth. This method 
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allows inputting sufficient number of frames in feature 
descriptor even for shortest expressions having smallest 
number of frames used for extraction and also enables good 
extraction results on increasing frame numbers used for 
extraction. 

3. 5 Temporal Interpolation Method 

Video of a micro-expression is viewed as a sequence 
of sampled images along a curve [13] to create a continues 
function in a low-dimensional manifold by micro-expression 
represented video as graphical path P n having n vertices as in 
figure 2. Vertices correspond to the video frame and edges 
correspond to the adjacency matrix 
IV with iVy = 1 if \i-j\ =1 and 0 otherwise. 



(a) (b) 


Fig 2 : a) graphical representation of facial micro expression, 
b) shows the temporal interpolation method mapped video 
along a curve. 

For embedding the manifold in the graph mapping P n to the 
line which minimises the distance between joined vertices. 
Let y = (yl,y2, . ,y n ) T be the map. Minimised to 

^ Cyc - Vj? W ij ‘ i'j = 1-2 . n 

WJ 

Eq. 4 

Obtaining y, which is equal to calculating the Eigen vectors of 
the laplacian graph of P n After computing the laplacian graph 

such a way that it has Eigen vectors i.e. 3 ^ . . 

and it enables to view y k as sets of points described as 

(0 = sinfjsfct -F ,t(?i — k) / (2n}},£e[l / n, 1] 

Eq. 5 


Sampled at t=l/n, 2/n,.,1. the resulting curve can be used to 

temporary interpolate the images at arbitrary position within a 
micro-expression. 



FIG 3: Temporal interpolation. The vertical temporal 
patterns. 


■/i n m 

jf to 

■/T-iCoJ 

Eq. 6 

To find correspondence for curve / ™ within the image space, 
by mapping the image frames to points defined as 

rci/«),rc2/«)-—rci> 

and using the linear extension of graph embedding for 
learning a transformation vector W which will minimise 



l.j 


Eq. 8 

Where X[ = £. - £ is a removed mean vector and is the 
vectorised image. X. He et al. solved this eigen value resulting 
problem 

X'LJFw =A.XX T W 
Eq. 8 

By using singular value decomposition with X = U^V T . 
Zhou etal. Proved that a new image can be interpolated by 

£= VMp ( 0 - 1-5 

Eq. 9 

Where M is square matrix 

The validity depends upon assuming linear independency of 
the assumption held for SMIC database. 

On computing a temporally interpolated frame image 
sequence £[ $ = UMf n (t) +- e for all 0eT,c, sC, 

Compute all possible combination of them with different 
SLTD block parameter T and choosing the number of frames 

0eT and parameters pET this will help maximizing the 
accuracy for given C. 


r® = 


IV. RESULT 


4.1 Dataset 

SMIC Database : The database [14] was recorded in an indoor 
bunker environment designed to resemble an interrogation 
room. Indoor illumination was controlled stable through the 
whole data recording period with four lights from the four 
upper corners of the room. 16 carefully selected movie clips, 
which can induce strong 

emotions, were shown to participants on a computer monitor 
together with a speaker for audio output. Participants sat 
about 50cm from the computer monitor. While participants 
were watching the film clips, a camera fixed on top of the 
monitor recorded their facial reactions. The setup is 
illustrated in Figure 1. 20 participants participated in the 
recording experiment. 

For the recording of the first ten participants, a high speed 
(HS) camera (PixeLINK PL-B774U, 640x480) of lOOfps was 
used to record the short duration of micro-expressions. In 
addition to the high speed camera another integrated camera 
box was added which consists of a normal visual camera 
(VIS) and a near-infrared (NIR) camera, both with 25 fps and 
resolution of 640x480. The 

VIS and NIR cameras were added for two reasons: first, to 
improve the diversity of the database; second, to investigate 
whether the current method can also be used on normal speed 
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cameras of 25 fps. In contrast to a down-sampled version of 
the 100 fps data, the 25 fps data yields data similar to standard 
web cameras, including their limitations such as motions 
blurs. When multiple cameras were used, they were put 
parallel to each other and fixed on the middle top of the 
monitor to ensure frontal view recording. Due to technical 
issues there was a time delay about 3-5 seconds between the 
starting points of the three cameras. VIS and NIR clips were 
manually synchronized with the reference to the HS data. 

4.2 Implementation 

On implementing proposed algorithm detection is done on the 
basis of leave one subject out on database i.e. the first frame 
of every dataset video is taken as a baseline frame for that 
detection process. 

The final SMIC database contain 164 
micro-expression video clips from 16 participants, the clips 
are recorded at HS (high speed) camera rate. The frame 
distribution of the recorded video clips is shown in figure 4. 



F 

ig 4\ The frame distribution for the video recordings. 

As the SLTD (Spatiotemporal Local Texture Descriptors) 
uses LBP-TOP. Thus in this system block sizes are used for 
MKL (Multiple Kernel Learning). Non-MKL classifications 
results are listed with SLTD 8 * 8n , where the image is divided 
into 8*8 blocks in spatial domain. The proposed system is 
tested on the SMIC database. The input video is in a .AVI 
format. The video is a short duration recording of a subject 
responding to the hidden emotion but that emotion is 
deliberately suppressed by the subject. Objective of this 
system is to detect facial micro-expression. 

The output of this system is in a Gray scale image 
format showing the particular output frame of the video which 
has the highest peek value for micro-expression detection as 
shown in figure 5. 



Fig 5: A) Neutral face frame, B) detected frame and C) 
difference between A and B. 

A is the first frame in the video which is a neutral face and 
taken as a baseline face for applying Face to Model Face 


mapping through which the facial features are extracted using 
ASM (Active Shape Model). B is the detected frame having 
the micro expression. It is hard to see the difference between 
the two frames that’s why the third figure is shown as the 
difference of A and B. 

The graphical representation of the outputs is shown in the 
figure 6. 


Fig 6\ Graphical representation for D) Neutral face frame, E) 
Detected frame and F) difference between D and E. 

As it is clear to see the graph, the difference between the two 
frames is having the highest peek value for the detected 
micro-expression. 
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